Statements

Acknowledgement
I am sincerely thank my parents and family for giving me the support and opportunity to invest my time on learning Machine Learning and Artificial Intelligence to apply in environmental management work. Furthermore, I thank the Google Career Certification courses for providing me the resources to learn {python} Programming and learn about the Machine Learning Concepts.

Use of generative artificial intelligence
Generative artificial intelligence (GenAI) was mainly used for creating charts and adjusting visualization parameters in {python}. GenAI was also used for code debugging. However, the responses provided by GenAI were critically judged before being implemented.

Executive Summary

Problem Statement
Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They refer to you as a data analytics professional and ask you to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?

Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.

Project Aim and Focus
Goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

Raw data used
This project uses a dataset called HR_capstone_dataset.csv. It represents 10 columns of self-reported information from employees of a fictitious multinational vehicle manufacturing corporation.

Methodology
The following methodology was undertaken for this project, - Raw data - HR_capstone_dataset.csv from the HR department is used to assess the needs of the Senior leadership team.
- The merged data set is split into 70% training and 30% test data which is used to train and predict using machine learning models.
- Analysis such as confusion matrix, feature importance and scoring metrics is performed to analyse the models performance in predicting the employee satisfaction levels and the main factors influencing the employees to quit.

Results
Out of the models, .

1 Introduction

Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. Its global workforce of over 100,000 employees research, design, construct, validate, and distribute electric, solar, algae, and hydrogen-based vehicles. Salifort’s end-to-end vertical integration model has made it a global leader at the intersection of alternative energy and automobiles.

The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to the data analytics professional and ask them to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?

Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.

2 Aim and Methodology of this Project

For this project, the key stakeholders include the HR department and the senior leadership team, as they are directly involved in employee management and decision-making. The senior leadership team has tasked the data analyst with analyzing the dataset to come up with ideas for how to increase employee retention. To help with this, they would like you to build a machine learning model that predicts whether an employee will leave the company based on their department, number of projects, average monthly hours, and any other data points you deem helpful.

Goals
The primary objective is to identify and predict the underlying drivers contributing to employee turnover, which can help in formulating effective retention strategies. Goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

Methodology
For this project, the analyst chooses a method to approach this data challenge, either selecting a regression model or a tree-based machine learning model to predict whether an employee will leave the company. The following methodology was undertaken for this project,

  • Raw data - HR_capstone_dataset.csv from the HR department is used to assess the needs of the Senior leadership team.
  • The merged data set is split into 70% training and 30% test data which is used to train and predict using machine learning models.
  • Analysis such as confusion matrix, feature importance and scoring metrics is performed to analyse the models performance in predicting the employee satisfaction levels and the main factors influencing the employees to quit.

3 Exploratory Data Analysis - EDA

This project uses a dataset called HR_capstone_dataset.csv, which is downloaded from the Kaggle website here. In the EDA, the dataset is analysed and prepared for building the machine learning models. Analysis such as, - Loading the required packages and the data set 

  • Checking the descriptive statistics
  • Check for missing, duplicate values and outliers
  • Visualizing the relationship within or between the data variables

3.1 Load required libraries

First, loading the libraries and packages that are needed for predicting the employee satisfaction project. The selected libraries provide functions for handling data, building and performing machine learning tasks, and visualizing results.


# Import packages
# Operational Packages
import numpy as np
import pandas as pd
import io
import pickle 

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from tabulate import tabulate

# Modelling packages
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#XGBoost
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance

# Modelling evaluation and metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_tree

3.2 Data Loading and Pre-processing

3.2.1 Data Loading and exploration

To start the project, loading the dataset HR_capstone_dataset.csv, and analyse the basic of the dataset. The dataset represents 10 columns of self-reported information from employees of a fictitious multinational vehicle manufacturing corporation.

# Load dataset into a dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load CSV
df0 = pd.read_csv(r"D:\Study\Machine Learning\Projects\R-Git\Completed projects for GitHub\Predicting-the-employee-satisfaction-levels-at-Salifort-Motors\Data\HR_capstone_dataset.csv")

# Format first 5 rows like a kable table
print(tabulate(df0.head(), headers='keys', tablefmt='latex'))
## \begin{tabular}{rrrrrrrrrll}
## \hline
##     &   satisfaction\_level &   last\_evaluation &   number\_project &   average\_montly\_hours &   time\_spend\_company &   Work\_accident &   left &   promotion\_last\_5years & Department   & salary   \\
## \hline
##   0 &                 0.38 &              0.53 &                2 &                    157 &                    3 &               0 &      1 &                       0 & sales        & low      \\
##   1 &                 0.8  &              0.86 &                5 &                    262 &                    6 &               0 &      1 &                       0 & sales        & medium   \\
##   2 &                 0.11 &              0.88 &                7 &                    272 &                    4 &               0 &      1 &                       0 & sales        & medium   \\
##   3 &                 0.72 &              0.87 &                5 &                    223 &                    5 &               0 &      1 &                       0 & sales        & low      \\
##   4 &                 0.37 &              0.52 &                2 &                    159 &                    3 &               0 &      1 &                       0 & sales        & low      \\
## \hline
## \end{tabular}

In this step, gaining a comprehensive understanding of the data set and preparing it for modelling is essential. This involves reviewing all variables to understand their data types, statistical distributions, and relevance to the target objective.

# Gather basic information about the data
# Create a StringIO buffer
buffer = io.StringIO()

# Capture the output of df.info() into the buffer
df0.info(buf=buffer)

# Get the content from the buffer
info_str = buffer.getvalue()

# Print the content
print(info_str)
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 14999 entries, 0 to 14998
## Data columns (total 10 columns):
##  #   Column                 Non-Null Count  Dtype  
## ---  ------                 --------------  -----  
##  0   satisfaction_level     14999 non-null  float64
##  1   last_evaluation        14999 non-null  float64
##  2   number_project         14999 non-null  int64  
##  3   average_montly_hours   14999 non-null  int64  
##  4   time_spend_company     14999 non-null  int64  
##  5   Work_accident          14999 non-null  int64  
##  6   left                   14999 non-null  int64  
##  7   promotion_last_5years  14999 non-null  int64  
##  8   Department             14999 non-null  object 
##  9   salary                 14999 non-null  object 
## dtypes: float64(2), int64(6), object(2)
## memory usage: 1.1+ MB
# Print the descriptive statistics
print(tabulate(df0.describe(), headers='keys', tablefmt='simple'))
##          satisfaction_level    last_evaluation    number_project    average_montly_hours    time_spend_company    Work_accident          left    promotion_last_5years
## -----  --------------------  -----------------  ----------------  ----------------------  --------------------  ---------------  ------------  -----------------------
## count          14999              14999              14999                    14999                14999           14999         14999                   14999
## mean               0.612834           0.716102           3.80305                201.05                 3.49823         0.14461       0.238083                0.0212681
## std                0.248631           0.171169           1.23259                 49.9431               1.46014         0.351719      0.425924                0.144281
## min                0.09               0.36               2                       96                    2               0             0                       0
## 25%                0.44               0.56               3                      156                    3               0             0                       0
## 50%                0.64               0.72               4                      200                    3               0             0                       0
## 75%                0.82               0.87               5                      245                    4               0             0                       0
## max                1                  1                  7                      310                   10               1             1                       1

The HR_capstone_dataset.csv dataset contains 14999 row entries and 10 columns, out of which, 2 are float, 6 are integers and 2 are objects. Upon initial exploration of the data set, most of the variables in the survey data align with prediction variables but certain variables can be engineered for effective predictions. Ethical considerations at this point, is the consideration of the bias in the recorded data both during the analysis and while interpreting and presenting the results to ensure fairness and accuracy.
Descriptive analysis of the dataset is shown here. Based on this,

  • Most of the employees work on an average of ~3.8 projects and ~201 horus per month.
  • Satisfaction levels range from 0.09 - 1 with a mean of ~0.61, while the last evaluation score is ~0.72.
  • Most employees have not had accidents or been promoted in the last 5 years.

3.2.2 Data Cleaning

In this step, the HR_capstone_dataset.csv dataset is then cleaned by addressing missing values, removing redundant or duplicate entries, and identifying any anomalies or inconsistencies. Outliers that could potentially distort model performance is also detected and evaluated for appropriate handling. These steps ensures that the dataset was accurate, consistent, and ready for further analysis, laying a solid foundation for building reliable predictive models.

Rename columns

As a data cleaning step, rename the columns as needed. Standardizing the column names so that they are all in snake_case, correcting any column names that are misspelled, and making sure column names more concise as needed.

# Display all column names
df0.columns
## Index(['satisfaction_level', 'last_evaluation', 'number_project',
##        'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
##        'promotion_last_5years', 'Department', 'salary'],
##       dtype='object')
# Rename columns as needed
df = df0.copy()
df = df0.rename(columns={'satisfaction_level':'satisfaction',
                          'last_evaluation':'last_eval',
                          'number_project':'#_projects',
                          'average_montly_hours':'avg_mon_hrs',
                          'time_spend_company':'tenure',
                          'Work_accident':'work_accident',
                          'promotion_last_5years':'promotion_<5yrs',
                         'Department':'department'
                         })


# Display all column names after the update
df.columns
## Index(['satisfaction', 'last_eval', '#_projects', 'avg_mon_hrs', 'tenure',
##        'work_accident', 'left', 'promotion_<5yrs', 'department', 'salary'],
##       dtype='object')

Check missing values

Checking for any missing values in the data. There appears to be no missing values in this dataset.

# Check for missing values
df.isnull().sum()
## satisfaction       0
## last_eval          0
## #_projects         0
## avg_mon_hrs        0
## tenure             0
## work_accident      0
## left               0
## promotion_<5yrs    0
## department         0
## salary             0
## dtype: int64

Check duplicates

Checking for any duplicate entries in the data. Based on the duplicate data set, there are several continuous variables across all the 10 columns which is very highly likely that these observations are duplicates. Therefore dropping them will help in making accurate predictions.

# Check for duplicates
df.duplicated().sum()
## np.int64(3008)
# Inspect some rows containing duplicates as needed
print(tabulate(df[df.duplicated()].head(), headers='keys', tablefmt='simple'))
##         satisfaction    last_eval    #_projects    avg_mon_hrs    tenure    work_accident    left    promotion_<5yrs  department    salary
## ----  --------------  -----------  ------------  -------------  --------  ---------------  ------  -----------------  ------------  --------
##  396            0.46         0.57             2            139         3                0       1                  0  sales         low
##  866            0.41         0.46             2            128         3                0       1                  0  accounting    low
## 1317            0.37         0.51             2            127         3                0       1                  0  sales         medium
## 1368            0.41         0.52             2            132         3                0       1                  0  RandD         low
## 1461            0.42         0.53             2            142         3                0       1                  0  sales         low
# Drop duplicates and save resulting dataframe in a new variable as needed
df1 = df.drop_duplicates(keep='first')

# Display first few rows of new dataframe as needed
print(tabulate(df1.head(), headers='keys', tablefmt='simple'))
##       satisfaction    last_eval    #_projects    avg_mon_hrs    tenure    work_accident    left    promotion_<5yrs  department    salary
## --  --------------  -----------  ------------  -------------  --------  ---------------  ------  -----------------  ------------  --------
##  0            0.38         0.53             2            157         3                0       1                  0  sales         low
##  1            0.8          0.86             5            262         6                0       1                  0  sales         medium
##  2            0.11         0.88             7            272         4                0       1                  0  sales         medium
##  3            0.72         0.87             5            223         5                0       1                  0  sales         low
##  4            0.37         0.52             2            159         3                0       1                  0  sales         low

Check outliers

Checking for outliers in the data. Certain types of models are more sensitive to outliers than others. Considering whether to remove outliers, is based on the type of models that will be used in the project.

# Create a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(16,6))
plt.title('Detecting outliers for tenure (Boxplot)', fontsize=15)
plt.xticks(fontsize=8)
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0.0, 0, '0.0'), Text(0.2, 0, '0.2'), Text(0.4, 0, '0.4'), Text(0.6000000000000001, 0, '0.6'), Text(0.8, 0, '0.8'), Text(1.0, 0, '1.0')])
plt.yticks(fontsize=8)
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0, 0.0, '0.0'), Text(0, 0.2, '0.2'), Text(0, 0.4, '0.4'), Text(0, 0.6000000000000001, '0.6'), Text(0, 0.8, '0.8'), Text(0, 1.0, '1.0')])
sns.boxplot(x=df1['tenure'])
plt.show()

The box plot shows that there are outliers in the tenure column. Checking how many rows contain outliers in the tenure column.

# Determine the number of rows containing outliers
# 25th Percentile for tenure
percentile25 = df1['tenure'].quantile(0.25)

# 75th Percentile for tenure
percentile75 = df1['tenure'].quantile(0.75)

# IQR - Inter Quartile Range
iqr = percentile75 - percentile25

# Limits of the tenure
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print('Lower limit:', lower_limit)
## Lower limit: 1.5
print('Upper limit:', upper_limit)
## Upper limit: 5.5
# Identifying the outliers in 'tenure'
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]

# print the rows containing the outliers
print(f'Number of rows containing outliers in tenure:', len(outliers))
## Number of rows containing outliers in tenure: 824
  • What did you observe about the relationships between variables?
  • What do you observe about the distributions in the data?
  • What transformations did you make with your data? Why did you chose to make those decisions?
  • What are some purposes of EDA before constructing a predictive model?
  • What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
  • Do you have any ethical considerations in this stage?

3.2.3 Data exploration and visualization

Beginning by understanding how many employees left and what percentage of all employees this figure represents.

# Get numbers of people who left vs. stayed
print(df['left'].value_counts())
## left
## 0    11428
## 1     3571
## Name: count, dtype: int64
print()
# Get percentages of people who left vs. stayed
### YOUR CODE HERE ###
print(df['left'].value_counts(normalize=True))
## left
## 0    0.761917
## 1    0.238083
## Name: proportion, dtype: float64

Examining variables that are interesting to the relevance of the project and create plots to visualize relationships between variables in the data.

  • Correlation heat maps
  • Tenure vs satisfaction; tenure vs left distribution;
  • #_project vs avg_mon_hrs;
  • distribution of #_projects;
  • satisfaction vs salary; satisfaction vs avg_mon_hrs;
  • avg_mon_hrs vs last_eval; avg_mon_hrs vs promotion_<5yrs;
  • distribution of left;
# Select only numeric columns
numeric_df = df1.select_dtypes(include=['number'])

# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(numeric_df.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);
plt.show()

Correlation heatmap,

  • Number of projects, average monthly hours, and evaluation scores all have the highest positive correlation with each other than the rest (>0.1)
  • Whether or not an employee leaves is negatively correlates with their satisfaction level
# PLots to analyse Tenure vs satisfaction; tenure vs left distribution
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (18,6))

# Tenure vs left distribution
tenure_stay = df1[df1['left']==0]['tenure']
tenure_left = df1[df1['left']==1]['tenure']
sns.histplot(data=df1, x='tenure', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('Tenure distribution classified by employee who left', fontsize=12)


# Tenure vs Satisfaction
sns.boxplot(data=df1, x='satisfaction', y='tenure', hue='left', orient="h", saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Satisfaction vs Tenure', fontsize=12)

plt.show()

Box Plot

  • Satisfaction level is similar to early and long tenure employees
  • There is high dissatisfaction with short tenure employees who left and high satisfaction with employees who stayed with medium tenures
  • There is very low dissatisfaction level in the medium (4 year) tenure employees who left

Histogram Plot

Histogram distribution shows that only few people stay more than 5 years which might be due to promotions to higher ranks in the company

# plot for #_project vs avg_mon_hrs; distribution of #_projects 
fig, ax = plt.subplots(1, 2, figsize = (18,6))

# distribution of #_projects
projects_stay = df1[df1['left']==0]['#_projects']
projects_left = df1[df1['left']==1]['#_projects']
sns.histplot(data=df1, x='#_projects', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('No of projects distribution classified by employee who left', fontsize=12)


# #_project vs avg_mon_hrs
sns.boxplot(data=df1, x='avg_mon_hrs', y='#_projects', hue='left', orient="h",saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Average monthly hours by No of project', fontsize=12)
plt.show()

Based on the plots,

Histogram

  • Average monthly working hours is in the range 160 - 200 hrs.
  • Seems that employees who worked in 7 projects all left. Also employees with 6 projects worked more hours and but the ratio of who stayed and left is very similar. Here the mean hours of these groups between 250 - 300 hrs, indicating that they are overworked.
  • Optimal number of projects for the employees are 3 and 4, the people who left are considerably less than the one who stayed.

Box Plots
Employees who left the company,

  • Those who worked longer hours and more projects
  • They quit because of being overworked.
  • Those who worked least hours and less projects
  • Either they are fired or might have given notice to leave the company, so they were assigned fewer projects and worked lesser hours

# Plots for satisfaction vs salary; satisfaction vs last_eval; 
fig, ax = plt.subplots(1, 2, figsize = (18,6))

# plot for satisfaction vs salary
sns.boxplot(data=df1, x='satisfaction', y='salary', hue='left', 
            orient="h", saturation=0.75, ax=ax[0])
ax[0].invert_yaxis()
ax[0].legend(loc='upper left', title='Left')
ax[0].set_title('Satisfaction vs Salary', fontsize=12)

# Plot for satisfaction vs avg_mon_hrs
sns.scatterplot(data=df1, x='satisfaction', y='avg_mon_hrs', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Satisfaction level by average monthly work hours', fontsize='14')

Based on the plots,

Box plot

Salary has high relation with the satisfaction level. At low and medium salary level, there is very low satisfaction scores and high number of employees who left the company.

Scatter plot

Employees dissatisfaction level is very low who worked for long hours in the company and has a less than 0.5 satisfaction level aligns with employees who worked less hours which might be due to that they are fired or might have given notice to leave the company. This confirms with the previous box plots.


# Plot for avg_mon_hrs vs last_eval; avg_mon_hrs vs promotion_<5yrs
fig, ax = plt.subplots(1, 2, figsize = (18,6))

# Plot for avg_mon_hrs vs promotion_<5yrs
sns.scatterplot(data=df1, x='avg_mon_hrs', y='promotion_<5yrs', hue='left', ax=ax[0])
ax[0].set_title('Average monthly hours by promotion in the last 5 years', fontsize=12)

# Plot for avg_mon_hrs vs last_eval
sns.scatterplot(data=df1, x='avg_mon_hrs', y='last_eval', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Average monthly hours by evaluation score', fontsize=14)

Based on the plot,

Avg_mon_hrs vs Promotion_<5yrs

  • All the employees who left worked long hours and not promoted for their left the company
  • Only few employees worked the long hours were promoted.

avg_mon_hrs vs last_eval Employeed who left,

  • Overworked employees who worked well
  • Employees who worked less and with low evaluation score
  • Most of the employees work more than the average monthly work hours range
# Plot for distribution of employee who left by department
plt.figure(figsize=(11,8))
sns.histplot(data=df1, x='department', hue='left', discrete=1, 
             hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.title('Employees distribution classified by department', fontsize=12)

plt.show()

Sales, Technical and Support department are the top three department where the employees left compared to the other departments

3.2.4 EDA Outcomes and Insights

Key drivers of employees who left are because,

  • Long working hours
  • High number of projects
  • Not getting a promotion for their effort
  • Low evaluation scores

Most of the employees are mostly burned out working long hours, more number of projects and not receiving any benefits such as promotion or higher salary. This mainly points out the poor company management and the company policies that might have to be investigated further.

# paCe: Construct Stage

-   Determine which models are most appropriate
-   Construct the model
-   Confirm model assumptions
-   Evaluate model results to determine how well your model fits the
    data

🔎 \## Recall model assumptions

**Logistic Regression model assumptions** 
  - Outcome variable is categorical   
  - Observations are independent of each other   
  - No severe multicollinearity among X variables    
  - No extreme outliers   
  - Linear relationship between each X variable and the logit of the outcome variable    
  - Sufficiently large sample size

### Reflect on these questions as you complete the constructing stage.

-   Do you notice anything odd?
-   Which independent variables did you choose for the model and why?
-   Are each of the assumptions met?
-   How well does your model fit the data?
-   Can you improve it? Is there anything you would change about the
    model?
-   What resources do you find yourself using as you complete this
    stage? (Make sure to include the links.)
-   Do you have any ethical considerations in this stage?

[Double-click to enter your responses here.]


# Model Building, Training and Predictions

# Results and Evaluation

-   Fit a model that predicts the outcome variable using two or more
    independent variables
-   Check model assumptions
-   Evaluate the model

### Identify the type of prediction task.

**Objective -** to analyse whether or not the employee leaves the
company

### Identify the types of models most appropriate for this task.

This dependent variable has categorical values (0 & 1) which involves
binary classification.

Model to use, - Logistic regression - Tree-based ML model

### Modelling

Add as many cells as you need to conduct the modeling process.

### Logistic Regression

Binomial logistic regression suits this objective.

**Steps to take for the model**

-   Categorical variables must be encoded as the numeric values, i.e.
    department and salary
-   Department is a set of category which can be encoded with dummy
    values
-   Salary is hierarchial set of category, which should be encoded with
    ordinal values (0 - low, 1 - medium , 2 - high)
# Encoding the categorical into numerical 
# Copy the dataframe for the modelling
enc_df = df1.copy()

# Mapping the salary category with ordinal numbers according to hierarchy
salary_map = {'low':0, 'medium':1, 'high':2}

# Creating a new column for the salary map
enc_df['salary'] = enc_df['salary'].map(salary_map)

# Encoding the department with dummy variables
enc_df = pd.get_dummies(enc_df, drop_first=False)

enc_df.head()
##    satisfaction  last_eval  #_projects  avg_mon_hrs  tenure  work_accident  \
## 0          0.38       0.53           2          157       3              0   
## 1          0.80       0.86           5          262       6              0   
## 2          0.11       0.88           7          272       4              0   
## 3          0.72       0.87           5          223       5              0   
## 4          0.37       0.52           2          159       3              0   
## 
##    left  promotion_<5yrs  salary  department_IT  department_RandD  \
## 0     1                0       0          False             False   
## 1     1                0       1          False             False   
## 2     1                0       1          False             False   
## 3     1                0       0          False             False   
## 4     1                0       0          False             False   
## 
##    department_accounting  department_hr  department_management  \
## 0                  False          False                  False   
## 1                  False          False                  False   
## 2                  False          False                  False   
## 3                  False          False                  False   
## 4                  False          False                  False   
## 
##    department_marketing  department_product_mng  department_sales  \
## 0                 False                   False              True   
## 1                 False                   False              True   
## 2                 False                   False              True   
## 3                 False                   False              True   
## 4                 False                   False              True   
## 
##    department_support  department_technical  
## 0               False                 False  
## 1               False                 False  
## 2               False                 False  
## 3               False                 False  
## 4               False                 False
# Removing the outliers in the tenure and saving it in a new dataframe
df_lr = enc_df[(enc_df['tenure'] >= lower_limit) & (enc_df['tenure'] <= upper_limit)]

df_lr.head().reset_index(drop=True)
##    satisfaction  last_eval  #_projects  avg_mon_hrs  tenure  work_accident  \
## 0          0.38       0.53           2          157       3              0   
## 1          0.11       0.88           7          272       4              0   
## 2          0.72       0.87           5          223       5              0   
## 3          0.37       0.52           2          159       3              0   
## 4          0.41       0.50           2          153       3              0   
## 
##    left  promotion_<5yrs  salary  department_IT  department_RandD  \
## 0     1                0       0          False             False   
## 1     1                0       1          False             False   
## 2     1                0       0          False             False   
## 3     1                0       0          False             False   
## 4     1                0       0          False             False   
## 
##    department_accounting  department_hr  department_management  \
## 0                  False          False                  False   
## 1                  False          False                  False   
## 2                  False          False                  False   
## 3                  False          False                  False   
## 4                  False          False                  False   
## 
##    department_marketing  department_product_mng  department_sales  \
## 0                 False                   False              True   
## 1                 False                   False              True   
## 2                 False                   False              True   
## 3                 False                   False              True   
## 4                 False                   False              True   
## 
##    department_support  department_technical  
## 0               False                 False  
## 1               False                 False  
## 2               False                 False  
## 3               False                 False  
## 4               False                 False
df_lr.shape
## (11167, 19)
(11167, 19)

4 Building the Models

4.1 Linear Regression

# Setting the 'y' variable
y = df_lr['left']

# Setting the 'x' variable with dropping the left column
X = df_lr.drop('left', axis=1)
# Split the data into training (75%) and test (25%) dataset 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)
# Constructing the LogReg model
log_clf = LogisticRegression(random_state=0, max_iter=500)

# Fitting the model
log_clf.fit(X_train,y_train)
LogisticRegression(max_iter=500, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Use the model for the test dataset
y_pred = log_clf.predict(X_test)

# Constructing a confusion matrix 
# Computing values in the matrix
log_cm = confusion_matrix(y_test, y_pred, labels=log_clf.classes_)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix = log_cm, 
                                  display_labels = log_clf.classes_)

# Plot confusion matrix
log_disp.plot(values_format='')
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000023FC8E77D60>
# Display plot
plt.show()

Model accurately predicts,

  • True Positive - No. of people who left = 112
  • True Negative - No. of people who will not leave = 2193
  • False Positive - No. of people who will not leave but predicted will leave = 128
  • False Negative - No. of people who will leave but predicted will not leave = 359

Checking the class imbalance

df_lr['left'].value_counts(normalize=True)
## left
## 0    0.831468
## 1    0.168532
## Name: proportion, dtype: float64

The data shows 83% - 17% split and shows imbalance. Based on the model performance, we can check whether the data should be resampled with a bit of balance split

# Create classification report for logistic regression model
row_names = ['Predicted would not leave', 'Predicted would leave']
print(classification_report(y_test, y_pred, target_names=row_names))
##                            precision    recall  f1-score   support
## 
## Predicted would not leave       0.86      0.94      0.90      2321
##     Predicted would leave       0.47      0.24      0.32       471
## 
##                  accuracy                           0.83      2792
##                 macro avg       0.66      0.59      0.61      2792
##              weighted avg       0.79      0.83      0.80      2792

Classification report shows,

  • Precision = 79%
  • Recall = 83%
  • F1 score = 80%
  • Accuracy = 83%

The model shows very low scores in the objective which is the importance to predict employees who will leave. Hence, we can try other classification model - Decision Tree and Random Forest

4.2 Tree-based Model - Decision Tree & Random Forest


# Using the enc_df dataframe
# Setting the y variable
y = enc_df['left']

# Setting the X variable
X = enc_df.drop('left',axis=1)

# Split the data into training (75%) and test (25%) dataset 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)

4.2.1 Decision Tree Model 1

# Instantia the decision tree model
tree = DecisionTreeClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[2, 4, 6, None],
             'min_samples_leaf': [2, 6, 3],
             'min_samples_split': [2, 5,7]
             }

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
dtree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
# Fitting the model
dtree1.fit(X_train,y_train)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'max_depth': [2, 4, 6, None],
                         'min_samples_leaf': [2, 6, 3],
                         'min_samples_split': [2, 5, 7]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best parameters
dtree1.best_params_
## {'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 2}
# Check best AUC score on CV
dtree1.best_score_
## np.float64(0.9698667651120891)
def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
        model_name (string): what you want the model to be called in the output table
        model_object: a fit GridSearchCV object
        metric (string): precision, recall, f1, accuracy, or auc
  
    Returns a pandas df with the F1, recall, precision, accuracy, and auc scores
    for the model with the best mean 'metric' score across all validation folds.  
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'auc': 'mean_test_roc_auc',
                   'precision': 'mean_test_precision',
                   'recall': 'mean_test_recall',
                   'f1': 'mean_test_f1',
                   'accuracy': 'mean_test_accuracy'
                  }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract Accuracy, precision, recall, and f1 score from that row
    auc = best_estimator_results.mean_test_roc_auc
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy
  
    # Create table of results
    table = pd.DataFrame()
    table = pd.DataFrame({'model': [model_name],
                          'precision': [precision],
                          'recall': [recall],
                          'F1': [f1],
                          'accuracy': [accuracy],
                          'auc': [auc]
                        })
  
    return table
# Get all CV scores
dtree1_cv_results = make_results('Decision Tree 1 CV', dtree1, 'auc')
dtree1_cv_results
##                 model  precision    recall        F1  accuracy       auc
## 0  Decision Tree 1 CV    0.91449  0.916279  0.915345  0.971867  0.969867

The metrics scores are very high. The model performance is very good, but decision tree model is prone to overfitting. Random Forest Model is performed to compare the models

4.2.2 Random Forest 1

# Instantiate model
rf = RandomForestClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3, None], 
             'max_features': [1.0],
             'max_samples': [0.7, 1.0],
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3],
             'n_estimators': [100]
             }    

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# Fitting the model
rf1.fit(X_train, y_train) 
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [3, None], 'max_features': [1.0],
                         'max_samples': [0.7, 1.0],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3], 'n_estimators': [100]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best params
rf1.best_params_
## {'max_depth': 3, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100}
  
# Check best AUC score on CV
rf1.best_score_
## np.float64(0.9741727239274944)
# Get all CV scores
rf1_cv_results = make_results('Random Forest 1 CV', rf1, 'auc')
results = pd.concat([rf1_cv_results,dtree1_cv_results], axis=0)
results
##                 model  precision    recall        F1  accuracy       auc
## 0  Random Forest 1 CV   0.836573  0.916277  0.874469  0.956299  0.974173
## 0  Decision Tree 1 CV   0.914490  0.916279  0.915345  0.971867  0.969867
model precision recall F1 accuracy auc
0 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425
0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867

Based on the model training results, - Random Forest Model scores better than the Decision Tree, exception is the recall but not too significant just 0.001 lower. - Random Forest Model performs well than the Decision Tree and the test set can be evaluated using the Random Forest.

def get_scores(model_name:str, model, X_test_data, y_test_data):
    '''
    Generate a table of test scores.

    In: 
        model_name (string):  How you want your model to be named in the output table
        model:                A fit GridSearchCV object
        X_test_data:          numpy array of X_test data
        y_test_data:          numpy array of y_test data

    Out: pandas df of precision, recall, f1, accuracy, and AUC scores for your model
    '''

    preds = model.best_estimator_.predict(X_test_data)

    auc = roc_auc_score(y_test_data, preds)
    accuracy = accuracy_score(y_test_data, preds)
    precision = precision_score(y_test_data, preds)
    recall = recall_score(y_test_data, preds)
    f1 = f1_score(y_test_data, preds)

    table = pd.DataFrame({'model': [model_name],
                          'precision': [precision], 
                          'recall': [recall],
                          'f1': [f1],
                          'accuracy': [accuracy],
                          'AUC': [auc]
                         })
  
    return table
  
# Get predictions on test data
rf1_test_scores = get_scores('Random Forest 1 Test', rf1, X_test, y_test)
rf1_test_scores
##                   model  precision    recall        f1  accuracy       AUC
## 0  Random Forest 1 Test   0.855535  0.915663  0.884578  0.960307  0.942431
model precision recall f1 accuracy AUC
0 Random Forest 1 Test 0.964211 0.919679 0.941418 0.980987 0.956439

Test results are similar to the training results, which shows that the model is very good. The model’s performance will be similar when new unseen data is fitted, as the test data was used only for this model.

Round 1 Models included all the variables as features for the model prediction. For the Round 2 Models, Feature engineering will be used to customize the data for improving the model.

5 Building the Model

5.1 After Feature Engineering

What can be engineered in the dataset, - Satisfaction level cannot be reported for all the employees. So, dropping it would be an option - Average monthly hours might have data leakage, as it might be recorded after the employees gives notice to resign or company has given the notice to leave. So, maybe engineering this variable to a new variable as overworked might help improve the models prediction

# Drop `satisfaction_level` and save resulting dataframe in new variable
df2 = enc_df.drop('satisfaction', axis=1)

# Display first few rows of new dataframe
df2.head()
##    last_eval  #_projects  avg_mon_hrs  tenure  work_accident  left  \
## 0       0.53           2          157       3              0     1   
## 1       0.86           5          262       6              0     1   
## 2       0.88           7          272       4              0     1   
## 3       0.87           5          223       5              0     1   
## 4       0.52           2          159       3              0     1   
## 
##    promotion_<5yrs  salary  department_IT  department_RandD  \
## 0                0       0          False             False   
## 1                0       1          False             False   
## 2                0       1          False             False   
## 3                0       0          False             False   
## 4                0       0          False             False   
## 
##    department_accounting  department_hr  department_management  \
## 0                  False          False                  False   
## 1                  False          False                  False   
## 2                  False          False                  False   
## 3                  False          False                  False   
## 4                  False          False                  False   
## 
##    department_marketing  department_product_mng  department_sales  \
## 0                 False                   False              True   
## 1                 False                   False              True   
## 2                 False                   False              True   
## 3                 False                   False              True   
## 4                 False                   False              True   
## 
##    department_support  department_technical  
## 0               False                 False  
## 1               False                 False  
## 2               False                 False  
## 3               False                 False  
## 4               False                 False
# Create `overworked` column. For now, it's identical to average monthly hours.
df2['overworked'] = df2['avg_mon_hrs']

# Inspect max and min average monthly hours values
print('Max hours:', df2['overworked'].max())
## Max hours: 310
print('Min hours:', df2['overworked'].min())
## Min hours: 96
# Define `overworked` as working > 175 hrs/week
df2['overworked'] = (df2['overworked'] > 175).astype(int)

# Display first few rows of new column
df2['overworked'].head()
## 0    0
## 1    1
## 2    1
## 3    1
## 4    0
## Name: overworked, dtype: int64
Max hours: 310
Min hours: 96

Assuming the 40 hrs job/per week with two weeks vacation policy, Average working hours per month = 40 hours * 50 weeks / 12 months = 166.67 hours

Overworked can be defined as working hours more than 175 hours per month on average.

0    0
1    1
2    1
3    1
4    0
Name: overworked, dtype: int64
# Drop the `average_monthly_hours` column
df2 = df2.drop('avg_mon_hrs', axis=1)

# Display first few rows of resulting dataframe
df2.head()
##    last_eval  #_projects  tenure  work_accident  left  promotion_<5yrs  \
## 0       0.53           2       3              0     1                0   
## 1       0.86           5       6              0     1                0   
## 2       0.88           7       4              0     1                0   
## 3       0.87           5       5              0     1                0   
## 4       0.52           2       3              0     1                0   
## 
##    salary  department_IT  department_RandD  department_accounting  \
## 0       0          False             False                  False   
## 1       1          False             False                  False   
## 2       1          False             False                  False   
## 3       0          False             False                  False   
## 4       0          False             False                  False   
## 
##    department_hr  department_management  department_marketing  \
## 0          False                  False                 False   
## 1          False                  False                 False   
## 2          False                  False                 False   
## 3          False                  False                 False   
## 4          False                  False                 False   
## 
##    department_product_mng  department_sales  department_support  \
## 0                   False              True               False   
## 1                   False              True               False   
## 2                   False              True               False   
## 3                   False              True               False   
## 4                   False              True               False   
## 
##    department_technical  overworked  
## 0                 False           0  
## 1                 False           1  
## 2                 False           1  
## 3                 False           1  
## 4                 False           0
# Isolate the outcome variable
y = df2['left']

# Select the features
X = df2.drop('left', axis=1)

# Create test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)

5.1.1 Decision Tree 2

# Instantiate model
tree = DecisionTreeClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
             'min_samples_leaf': [2, 5, 1],
             'min_samples_split': [2, 4, 6]
             }

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
dtree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
 
dtree2.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'max_depth': [4, 6, 8, None],
                         'min_samples_leaf': [2, 5, 1],
                         'min_samples_split': [2, 4, 6]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best params
dtree2.best_params_
## {'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
# Check best AUC score on CV
dtree2.best_score_
## np.float64(0.9586752505340426)
0.9586752505340426
# Get all CV scores
dtree2_cv_results = make_results('Decision Tree 2 CV', dtree2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results], axis=0)
results
##                 model  precision    recall        F1  accuracy       auc
## 0  Decision Tree 1 CV   0.914490  0.916279  0.915345  0.971867  0.969867
## 0  Decision Tree 2 CV   0.856693  0.903553  0.878882  0.958523  0.958675
## 0  Random Forest 1 CV   0.836573  0.916277  0.874469  0.956299  0.974173

5.1.2 Random Forest 2

# Instantiate model
rf = RandomForestClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3, None], 
             'max_features': [1.0],
             'max_samples': [0.7, 1.0],
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3],
             'n_estimators': [100]
             }  

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# Fitting the Model
rf2.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [3, None], 'max_features': [1.0],
                         'max_samples': [0.7, 1.0],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3], 'n_estimators': [100]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best params
rf2.best_params_
## {'max_depth': None, 'max_features': 1.0, 'max_samples': 1.0, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100}
# Check best AUC score on CV
rf2.best_score_
## np.float64(0.960396763726207)
0.9648100662833985
# Get all CV scores
rf2_cv_results = make_results('Random Forest 2 CV', rf2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results,rf2_cv_results], axis=0)
results
##                 model  precision    recall        F1  accuracy       auc
## 0  Decision Tree 1 CV   0.914490  0.916279  0.915345  0.971867  0.969867
## 0  Decision Tree 2 CV   0.856693  0.903553  0.878882  0.958523  0.958675
## 0  Random Forest 1 CV   0.836573  0.916277  0.874469  0.956299  0.974173
## 0  Random Forest 2 CV   0.912536  0.880106  0.895991  0.966085  0.960397
model precision recall F1 accuracy auc
0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
0 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675
0 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425
0 Random Forest 2 CV 0.866758 0.878754 0.872407 0.957411 0.964810

Based on the training results for the two rounds of Decision Tree and Random Forest Model,

  • Random Forest model performs well with ROC-AUC score as the deciding metric. So Random Forest model is the winning model and the test can now be used for prediction
# Get predictions on test data
rf2_test_scores = get_scores('Random Forest 2 Test', rf2, X_test, y_test)
test_results = pd.concat([rf1_test_scores, rf2_test_scores], axis=0)
test_results
##                   model  precision    recall        f1  accuracy       AUC
## 0  Random Forest 1 Test   0.855535  0.915663  0.884578  0.960307  0.942431
## 0  Random Forest 2 Test   0.901010  0.895582  0.898288  0.966311  0.937991
model precision recall f1 accuracy AUC
0 Random Forest 1 Test 0.964211 0.919679 0.941418 0.980987 0.956439
0 Random Forest 2 Test 0.870406 0.903614 0.886700 0.961641 0.938407

Plotting a Confusion Matrix to visualize the model’s predictions on the test set

# Generate array of values for confusion matrix
preds = rf2.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=rf2.classes_)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=rf2.classes_)
disp.plot(values_format='');

A perfect model would yield all true negatives and true positives, and no false negatives or false positives.

In this case, Model predicts more false positives than false negatives, which means that the employees are at risk of getting fired or leaving voluntarily but that is not the case. Although it still is a strong model for predicting the employees that stay

5.1.3 Plotting the Decision Tree

# Plot the tree
plt.figure(figsize=(85,20))
plot_tree(dtree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns, 
          class_names={0:'stayed', 1:'left'}, filled=True);
plt.show()

5.1.3.1 Decision tree feature importance

# Feature important
dtree2_importances = pd.DataFrame(dtree2.best_estimator_.feature_importances_, 
                                 columns=['gini_importance'], 
                                 index=X.columns
                                )
dtree2_importances = dtree2_importances.sort_values(by='gini_importance', ascending=False)

# Only extract the features with importances > 0
dtree2_importances = dtree2_importances[dtree2_importances['gini_importance'] != 0]
dtree2_importances
##                       gini_importance
## last_eval                    0.343958
## #_projects                   0.343385
## tenure                       0.215681
## overworked                   0.093498
## department_support           0.001142
## salary                       0.000910
## department_sales             0.000607
## department_technical         0.000418
## work_accident                0.000183
## department_IT                0.000139
## department_marketing         0.000078
gini_importance
last_eval 0.343958
#_projects 0.343385
tenure 0.215681
overworked 0.093498
department_support 0.001142
salary 0.000910
department_sales 0.000607
department_technical 0.000418
work_accident 0.000183
department_IT 0.000139
department_marketing 0.000078
sns.barplot(data=dtree2_importances, x="gini_importance", y=dtree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importances for Employee Leaving", fontsize=12)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()

Feature importance plot for the decision tree model shows that last_evaluation, #_project, tenure, and overworked have the importance from high to the least which helps in predicting the outcome variable ‘employee left’

5.1.3.2 Random forest feature importance

Now, plot the feature importances for the random forest model.

# Get feature importances
feat_impt = rf2.best_estimator_.feature_importances_

# Get indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]

# Get column labels of top 10 features 
feat = X.columns[ind]

# Filter `feat_impt` to consist of top 10 feature importances
feat_impt = feat_impt[ind]

y_df = pd.DataFrame({"Feature":feat,"Importance":feat_impt})
y_sort_df = y_df.sort_values("Importance")
fig = plt.figure()
ax1 = fig.add_subplot(111)

y_sort_df.plot(kind='barh',ax=ax1,x="Feature",y="Importance")

ax1.set_title("Random Forest: Important variables that have an impact in employees leaving", fontsize=12)
ax1.set_ylabel("Feature")
ax1.set_xlabel("Importance")

plt.show()

Feature importance plot for the Random Forest is the same as the Decision Tree model - feature importance plot

# pacE: Execute Stage

-   Interpret model performance and results
-   Share actionable steps with stakeholders

✏ \## Recall evaluation metrics

-   **AUC** is the area under the ROC curve; it's also considered the
    probability that the model ranks a random positive example more
    highly than a random negative example.
-   **Precision** measures the proportion of data points predicted as
    True that are actually True, in other words, the proportion of
    positive predictions that are true positives.
-   **Recall** measures the proportion of data points that are predicted
    as True, out of all the data points that are actually True. In other
    words, it measures the proportion of positives that are correctly
    classified.
-   **Accuracy** measures the proportion of data points that are
    correctly classified.
-   **F1-score** is an aggregation of precision and recall.

💭 \### Reflect on these questions as you complete the executing stage.

-   What key insights emerged from your model(s)?
-   What business recommendations do you propose based on the models
    built?
-   What potential recommendations would you make to your
    manager/company?
-   Do you think your model could be improved? Why or why not? How?
-   Given what you know about the data and the models you were using,
    what other questions could you address for the team?
-   What resources do you find yourself using as you complete this
    stage? (Make sure to include the links.)
-   Do you have any ethical considerations in this stage?

Double-click to enter your responses here.

6 Results and Evaluation

6.1 Summary of model results

Logistic Regression Model

Model’s performance on the test set shows very low scores in the objective which is the importance to predict employees who will leave,

Tree-based Machine Learning

After the feature engineering, on the test set, Random Forest Model outperformed the decision tree model

Decision Tree model performs with,

Random Forest model performs with,

6.2 Conclusion, Recommendations, Next Steps

From the initial assessment, EDA and Visualization, the employees are overworked due to the poor company management. This is also confirmed with the model and feature importance

Following recommendations could be presented to the stakeholders for retaining the employees:

Next Steps - Having a structured method for getting employees evaluation and satisfaction score before the employee leaves the company, as this might tend to data leakage. This might help in mitigating this issues and will help improve the model’s performance